In order to use this R package and its functions, you need to prepare a merged gene expression dataset manually. The workflow to achieve this is presented in the following.
This example uses the GEO lung cancer studies “GSE18842”, “GSE19804” and “GSE19188”, which contain in total 367 samples (197 tumor; 170 non-tumor) and 54,675 transcripts obtained by using Affymetrix GeneChip Human Genome U133 Plus 2.0 Array (platform GPL570).
To use microarray datasets from the GEO database the studies can be downloaded directly executing R code. Therefore, initializing the directory and creating the required folders is needed before downloading GEO studies. The object targetcol should also be titled “target” (correspondingly to the column name of the object sampleMetadata below). targetcol contains the outcome variable of interest for subsequent analysis. The levels of targetcol are to be defined with controlname and targetname. For this example we used the package to analyze DEGs and identify diagnostic and prognostic gene signatures for the separation of two groups (or subtypes, healthy vs cancer, etc) we named controlname and targetname as “Control” and “Lung Cancer”.
# initialize filePath:
filePath <- tempdir()
maindir <- "./geodata/"
datadir <- paste0(maindir, "data/")
dir.create(maindir)
dir.create(datadir)
# initialize more infos on the study
targetcol <- "target" # should be named "target"
controlname <- "Control"
targetname <- "Lung Cancer"Insert the GEO accesion numbers of the desired GEO studies into the getGEO() function. This will download the datasets containing expression, pheno and feature data as well as the annotations of the probe sets and store it as ‘.txt.gz’ files in the defined datadir folder.
GSE18842 <- GEOquery::getGEO("GSE18842", destdir = datadir)
GSE19804 <- GEOquery::getGEO("GSE19804", destdir = datadir)
# extracting ExpressionSets
eset1 <- GSE18842[[1]]
eset2 <- GSE19804[[1]]Moreover, this approach enables us to download also the raw data in CEL file format. These files need to be uncompressed. Subsequently a GCRMA normalization has to be conducted.
For this, further commands are necessary to subsequently import the CEL files into the R environment. esetC represents a normalized ExpressionSet, though only containing expression data without pheno data or feature data. These data can be added later. (The study GSE19188 could also be downloaded directly by applying getGEO() instead of using the raw data and manually adding pheno and feature data afterwards. We just wanted to demonstrate how to use raw data in form of CEL files and apply normalization to it.)
# download raw data, uncompress them and execute GCRMA normalization
# actually not necessary for GSE19188 but to showcase feasibility of dealing with .CEL files
GEOquery::getGEOSuppFiles("GSE19188", baseDir = filePath)
utils::untar(paste0(filePath, "/GSE19188/GSE19188_RAW.tar"), exdir=maindir)
cels <- list.files(maindir, pattern = "[gz]")
tryCatch({
sapply(paste(maindir, cels, sep="/"), GEOquery::gunzip)
# at this point, first the textfile "phenodata.txt must be generated
## For downloading raw data use the function: getGEOSuppFiles("").
## After downloading and uncompressing the raw file (*_RAW.tar) we need to capture
## the experimental information. Therefore a text file is created in the terminal
## window or rather in the command line.
system("bash -c 'ls ./geodata/*.CEL > ./geodata/phenodata.txt'")
## The command: ls data/*.CEL > data/phenodata.txt puts all .CEl files in one text file.
## After this step the normalization can be started.
}, error = function(e){
print(e)
})
cels <- list.files(maindir, pattern = "CEL$")
celfiles <- paste0(maindir, cels)
# use justGCRMA now due to memory issues with gcrma --> justGCRMA is much more memory efficient
# https://www.rdocumentation.org/packages/gcrma/versions/2.44.0/topics/justGCRMA
esetC <- gcrma::justGCRMA(filenames = celfiles, fast = TRUE)
# call garbage collection here
gc()
# old code
#celfiles <- affy::ReadAffy(filenames = gsub("\\.gz", "", cels), celfile.path = "./geodata")
#esetC <- gcrma::gcrma(celfiles, fast = TRUE)
# esetC = now is a normalized ExpressionSet, but without f- and pData
# the f- and pData are transfered manually from dataset GSE19188
GSE19188 <- GEOquery::getGEO("GSE19188", destdir = datadir)
eset3 <- GSE19188[[1]]
pData3 <- Biobase::pData(eset3)
fData3 <- Biobase::fData(eset3)
Biobase::pData(esetC) <- pData3
Biobase::fData(esetC) <- fData3
colnames(Biobase::exprs(esetC)) <- colnames(Biobase::exprs(eset3))
Biobase::annotation(esetC) <- Biobase::annotation(eset3)
eset3 <- esetCThe phenoData of the expresssionSets need to be formatted so that the phenoData columns overlap perfectly. This means that all ExpressionSets should contain identical columns in the same order. Non consistent columns need to be removed. For a simpler procedure we recommend only retaining phenoData columns required for the downstream analysis (e.g. targetcol and survival data).
# create version b, that original ExpressionSets persist
eset1b <- eset1
eset2b <- eset2
eset3b <- eset3
# show number of samples and phenoData columns
dim(Biobase::pData(eset1b))
#> [1] 91 33
dim(Biobase::pData(eset2b))
#> [1] 120 37
dim(Biobase::pData(eset3b))
#> [1] 156 44
# depict phenoData column names
colnames(Biobase::pData(eset1b))
#> [1] "title" "geo_accession"
#> [3] "status" "submission_date"
#> [5] "last_update_date" "type"
#> [7] "channel_count" "source_name_ch1"
#> [9] "organism_ch1" "characteristics_ch1"
#> [11] "characteristics_ch1.1" "molecule_ch1"
#> [13] "extract_protocol_ch1" "label_ch1"
#> [15] "label_protocol_ch1" "taxid_ch1"
#> [17] "hyb_protocol" "scan_protocol"
#> [19] "description" "data_processing"
#> [21] "platform_id" "contact_name"
#> [23] "contact_email" "contact_institute"
#> [25] "contact_address" "contact_city"
#> [27] "contact_zip/postal_code" "contact_country"
#> [29] "supplementary_file" "data_row_count"
#> [31] "relation" "sample type:ch1"
#> [33] "tissue:ch1"
colnames(Biobase::pData(eset2b))
#> [1] "title" "geo_accession"
#> [3] "status" "submission_date"
#> [5] "last_update_date" "type"
#> [7] "channel_count" "source_name_ch1"
#> [9] "organism_ch1" "characteristics_ch1"
#> [11] "characteristics_ch1.1" "characteristics_ch1.2"
#> [13] "characteristics_ch1.3" "molecule_ch1"
#> [15] "extract_protocol_ch1" "label_ch1"
#> [17] "label_protocol_ch1" "taxid_ch1"
#> [19] "hyb_protocol" "scan_protocol"
#> [21] "description" "data_processing"
#> [23] "platform_id" "contact_name"
#> [25] "contact_email" "contact_department"
#> [27] "contact_institute" "contact_address"
#> [29] "contact_city" "contact_zip/postal_code"
#> [31] "contact_country" "supplementary_file"
#> [33] "data_row_count" "age:ch1"
#> [35] "gender:ch1" "stage:ch1"
#> [37] "tissue:ch1"
colnames(Biobase::pData(eset3b))
#> [1] "title" "geo_accession"
#> [3] "status" "submission_date"
#> [5] "last_update_date" "type"
#> [7] "channel_count" "source_name_ch1"
#> [9] "organism_ch1" "characteristics_ch1"
#> [11] "characteristics_ch1.1" "characteristics_ch1.2"
#> [13] "characteristics_ch1.3" "characteristics_ch1.4"
#> [15] "molecule_ch1" "extract_protocol_ch1"
#> [17] "label_ch1" "label_protocol_ch1"
#> [19] "taxid_ch1" "hyb_protocol"
#> [21] "scan_protocol" "description"
#> [23] "data_processing" "platform_id"
#> [25] "contact_name" "contact_email"
#> [27] "contact_phone" "contact_fax"
#> [29] "contact_department" "contact_institute"
#> [31] "contact_address" "contact_city"
#> [33] "contact_zip/postal_code" "contact_country"
#> [35] "contact_web_link" "supplementary_file"
#> [37] "data_row_count" "relation"
#> [39] "relation.1" "cell type:ch1"
#> [41] "gender:ch1" "overall survival:ch1"
#> [43] "status:ch1" "tissue type:ch1"
# remove unnecessary phenoData
# rename desired column names and if necessary the column entries (as we did for 'Tumor')
# eset1b
eset1b$relation <- NULL
eset1b$characteristics_ch1 <- NULL
eset1b$`sample type:ch1` <- NULL
eset1b$`tissue:ch1` <- NULL
colnames(Biobase::pData(eset1b))[8] = targetcol
levels(eset1b[[targetcol]]) = c(controlname, targetname)
dim(eset1b)
#> Features Samples
#> 54675 91
# eset2b
eset2b$characteristics_ch1.2 <- NULL
eset2b$characteristics_ch1.3 <- NULL
eset2b$contact_department <- NULL
eset2b$characteristics_ch1 <- NULL
eset2b$`age:ch1` <- NULL
eset2b$`gender:ch1` <- NULL
eset2b$`stage:ch1` <- NULL
eset2b$`tissue:ch1` <- NULL
colnames(Biobase::pData(eset2b))[8] = targetcol
levels(eset2b[[targetcol]]) = c(controlname, targetname)
dim(eset2b)
#> Features Samples
#> 54675 120
# eset3b
eset3b$characteristics_ch1.2 <- NULL
eset3b$characteristics_ch1.3 <- NULL
eset3b$characteristics_ch1.4 <- NULL
eset3b$contact_phone <- NULL
eset3b$contact_fax <- NULL
eset3b$contact_department <- NULL
eset3b$contact_web_link <- NULL
eset3b$relation <- NULL
eset3b$source_name_ch1 <- NULL
colnames(Biobase::pData(eset3b))[9] = targetcol
levels(eset3b[[targetcol]]) = c(controlname, targetname)
pNew <- Biobase::pData(eset3b)[,c(1:7,9,8,10:29)] # reorder columns
Biobase::pData(eset3b) = pNew
dim(eset3b)
#> Features Samples
#> 54675 156
# now, all phenoData columns are consistent
cbind(colnames(Biobase::pData(eset1b)), colnames(Biobase::pData(eset2b)),
colnames(Biobase::pData(eset3b)))
#> [,1] [,2]
#> [1,] "title" "title"
#> [2,] "geo_accession" "geo_accession"
#> [3,] "status" "status"
#> [4,] "submission_date" "submission_date"
#> [5,] "last_update_date" "last_update_date"
#> [6,] "type" "type"
#> [7,] "channel_count" "channel_count"
#> [8,] "target" "target"
#> [9,] "organism_ch1" "organism_ch1"
#> [10,] "characteristics_ch1.1" "characteristics_ch1.1"
#> [11,] "molecule_ch1" "molecule_ch1"
#> [12,] "extract_protocol_ch1" "extract_protocol_ch1"
#> [13,] "label_ch1" "label_ch1"
#> [14,] "label_protocol_ch1" "label_protocol_ch1"
#> [15,] "taxid_ch1" "taxid_ch1"
#> [16,] "hyb_protocol" "hyb_protocol"
#> [17,] "scan_protocol" "scan_protocol"
#> [18,] "description" "description"
#> [19,] "data_processing" "data_processing"
#> [20,] "platform_id" "platform_id"
#> [21,] "contact_name" "contact_name"
#> [22,] "contact_email" "contact_email"
#> [23,] "contact_institute" "contact_institute"
#> [24,] "contact_address" "contact_address"
#> [25,] "contact_city" "contact_city"
#> [26,] "contact_zip/postal_code" "contact_zip/postal_code"
#> [27,] "contact_country" "contact_country"
#> [28,] "supplementary_file" "supplementary_file"
#> [29,] "data_row_count" "data_row_count"
#> [,3]
#> [1,] "title"
#> [2,] "geo_accession"
#> [3,] "status"
#> [4,] "submission_date"
#> [5,] "last_update_date"
#> [6,] "type"
#> [7,] "channel_count"
#> [8,] "target"
#> [9,] "organism_ch1"
#> [10,] "characteristics_ch1.1"
#> [11,] "molecule_ch1"
#> [12,] "extract_protocol_ch1"
#> [13,] "label_ch1"
#> [14,] "label_protocol_ch1"
#> [15,] "taxid_ch1"
#> [16,] "hyb_protocol"
#> [17,] "scan_protocol"
#> [18,] "description"
#> [19,] "data_processing"
#> [20,] "platform_id"
#> [21,] "contact_name"
#> [22,] "contact_email"
#> [23,] "contact_institute"
#> [24,] "contact_address"
#> [25,] "contact_city"
#> [26,] "contact_zip/postal_code"
#> [27,] "contact_country"
#> [28,] "supplementary_file"
#> [29,] "data_row_count"Before using the sigident package, these variables need to be defined properly. One could use them also as arguments directly in the respective function. However, to keep the code clearer we define them here at the beginning and refer at each function to the respective variable. idtype can be either “affy” or “entrez”, indicating, if either Affy-IDs or Entrez-IDs are to be used as feature names for the subsequent analyses.
Selecting “entrez” usually results in a reduction of total features due to the removing of empty strings and duplicate IDs.
We here use idtype = “affy” for all subsequent analyses.
# general
plotdir <- "./plots/"
dir.create(plotdir)
csvdir <- "./csv/"
dir.create(csvdir)
idtype = "affy"
# enrichment analysis
species <- "Hs"
OrgDb <- "org.Hs.eg.db"
organism <- "hsa"
#pathwayid <- "hsa04110" # Cell Cycle
pathwayid <- "hsa04151" # PI3K-Akt
# diagnostic signature
seed <- 111
split <- 0.8
nfolds <- 10The metadata are extracted directly from the ExpressionSets and are deposited for a easier applicability in the data frame sampleMetadata. This data frame consists of the GEO accession numbers of the study (GPLxxx), the sample accession numbers (GSMxxx) and the targetcol (named “target”) containing the outcome of interest (currently, only a binary coded target variable is supported for the subsequent analyses, e.g. “Cancer - No Cancer”).
# workaround
GSE18842_meta <- data.table::data.table(
cbind(
study = "GSE18842",
sample = eset1b@phenoData@data[,"geo_accession"],
target = as.character(eset1b@phenoData@data[,targetcol])
)
)
GSE19804_meta <- data.table::data.table(
cbind(
study = "GSE19804",
sample = eset2b@phenoData@data[,"geo_accession"],
target = as.character(eset2b@phenoData@data[,targetcol])
)
)
GSE19188_meta <- data.table::data.table(
cbind(
study = "GSE19188",
sample = eset3b@phenoData@data[,"geo_accession"],
target = as.character(eset3b@phenoData@data[,targetcol])
)
)
sampleMetadata <- rbind(GSE18842_meta,
GSE19804_meta,
GSE19188_meta)sutdyMetadata contains the GEO series accession numbers of the studies and the assignment due to its usage as discovery dataset or validation dataset.
In this framework for merging, only genes that are present in all datasets are preserved. In our example the entire probe set of 54675 transcripts remains. mergedset contains the samples of the studies grouped as discovery and represents an “ExpressionSet” object describing samples, annotations and further information about the features.
Batch effects are systematic non-biological variation between studies due to experimental and technical artifacts. As first visualization a boxplot is created with the included samples on the x-axis and the expression values on the y-axis. Thereby, considerable discrepancies between the three studies can already be recognized.
# visualize log2 transformed expression values of the merged dataset as boxplot
filename <- paste0(plotdir, "boxplot_GSE18842.jpeg")
jpeg(filename, width = 2000, height = 1000, res = 150, units = "px")
boxplot(Biobase::exprs(eset1b), main = "GSE18842 before batch correction",
xlab = "Samples", ylab ="Expression value")
dev.off()# visualize log2 transformed expression values of the merged dataset as boxplot
filename <- paste0(plotdir, "boxplot_GSE19804.jpeg")
jpeg(filename, width = 2000, height = 1000, res = 150, units = "px")
boxplot(Biobase::exprs(eset2b), main = "GSE19804 before batch correction",
xlab = "Samples", ylab ="Expression value")
dev.off()# visualize log2 transformed expression values of the merged dataset as boxplot
filename <- paste0(plotdir, "boxplot_GSE19188.jpeg")
jpeg(filename, width = 2000, height = 1000, res = 150, units = "px")
boxplot(Biobase::exprs(eset3b), main = "GSE19188 before batch correction",
xlab = "Samples", ylab ="Expression value")
dev.off()# visualize log2 transformed expression values of the merged dataset as boxplot
filename <- paste0(plotdir, "boxplot_merged_data.jpeg")
jpeg(filename, width = 2000, height = 1000, res = 150, units = "px")
boxplot(mergedset@assayData$exprs, main = "Merged data before batch correction",
xlab = "Samples", ylab ="Expression value")
dev.off()A more powerfull approach for batch effect detection is conducting a guided Principle Component Analysis (gPCA) implemented in the gPCA package.
# conducting gPCA for batch effect detection
DF <- mergedset@assayData$exprs
# define batches with number of samples
batch <- sigident::createBatch_(studyMetadata = studyMetadata,
sampleMetadata = sampleMetadata)
gPCA_before <- sigident::batchDetection_(mergeset = DF,
batch = batch)
filename <- paste0(plotdir, "PCplot_before.png")
sigident::createBatchPlot_(correction_obj = gPCA_before,
filename = filename,
time = "before")In order to correct for occurring batch effects and other unwanted variation in high-throughput experiments, the ComBat function from the sva package is applied inside the function batchCorrection_(). The ComBat function adjusts for known batches using an empirical Bayesian framework [1].
table(mergedset@phenoData@data[[targetcol]])
#>
#> Control Lung Cancer
#> 170 197
# generate list dd with diagnosis and design
dd <- sigident::createDiagnosisDesignBatch_(sampleMetadata = sampleMetadata,
studyMetadata = studyMetadata,
controlname = controlname,
targetname = targetname,
targetcol = targetcol)
diagnosis <- dd$diagnosis
length(diagnosis)
#> [1] 367
table(diagnosis)
#> diagnosis
#> 0 1
#> 170 197
design <- dd$design
# check batch assignment
# (batch is also created during createDiagnosisDesignBatch_())
batch <- dd$batch
table(batch)
#> batch
#> 1 2 3
#> 91 120 156
length(batch)
#> [1] 367
# removing batch effects computing linear models, take variance between the diagnosis into consideration
mergeset <- sigident::batchCorrection_(mergedset = mergedset,
batch = batch,
design = design,
idtype = idtype)
#> Standardizing Data across genes
dim(mergeset)
#> [1] 54675 367# remove unneeded objects
rm("DF", "GSE18842_meta", "GSE19188_meta", "GSE19804_meta")
# also no further purpose, but takes a lot of memory and may slow down your computer
rm("eset1", "eset1b", "eset2", "eset2b", "eset3", "eset3b", "esets", "GSE18842", "GSE19188", "GSE19804", "fData3", "esetC")
gc()mergeset results as output of the above described merging approach and represents a matrix containing batch corrected expression data with genes in the rows and samples in the columns, not an ExpressionSet anymore.
Creating a boxplot and a gPCA plot with mergeset shows that no batch effect is present anymore.
filename <- paste0(plotdir, "import_boxplot.png")
sigident::createImportBoxplot_(mergeset, filename)gPCA_after <- sigident::batchDetection_(mergeset = mergeset,
batch = batch)
filename <- paste0(plotdir, "PCplot_after.png")
sigident::createBatchPlot_(correction_obj = gPCA_after,
filename = filename,
time = "after")A common task preceeding expression data analysis is to perform statistical analysis to discover quantitative changes in expression levels between experimental groups. For this reason we here offer the identification of differentially expressed genes (DEGs) based on the limma package. In order correct for multiple testing we considered the p-value adjustment by conducting the “BH” method, which controls the expected false discovery rate (FDR).
A heatmap based on the selected DEGs is created and clear differences regarding the expression profiles between the groups (“Controls” [blue] and “Lung Cancer” [red]) can be recognized.
# define the false discovery rate here
FDR <- 0.05
genes <- sigident::identifyDEGs_(mergeset = mergeset,
design = design,
qValue = FDR)
# heatmap creation
filename <- paste0(plotdir, "DEG_heatmap.png")
# create colors for map
ht_colors <- sigident::colorHeatmap_(sampleMetadata = sampleMetadata,
studyMetadata = studyMetadata,
targetcol = targetcol,
controlname = controlname) # cancer = red
sigident::createDEGheatmap_(mergeset = mergeset,
genes = genes,
patientcolors = ht_colors,
filename = filename)Some investigations may have been finished at this point yielding the DEGs. We provide two tables containing annotations, like gene symbols (DEG_info.csv) and the results from the limma analysis including log2 fold changes and the adjusted p-values (DEG_results.csv).
deg_info <- sigident::exportDEGannotations_(mergedset = mergedset,
genes = genes,
idtype = idtype)
data.table::fwrite(deg_info, paste0(csvdir, "DEG_info.csv"))| probe_ID | gene_symbol | gene_title | genebank_accession | entrez_id |
|---|---|---|---|---|
| 1552318_at | GIMAP1 | GTPase, IMAP family member 1 | AK091818 | 170575 |
| 1552509_a_at | CD300LG | CD300 molecule-like family member g | NM_145273 | 146894 |
| 1552619_a_at | ANLN | anillin, actin binding protein | NM_018685 | 54443 |
| 1552767_a_at | HS6ST2 | heparan sulfate 6-O-sulfotransferase 2 | NM_147174 | 90161 |
| 1552797_s_at | PROM2 | prominin 2 | NM_144707 | 150696 |
| 1553105_s_at | DSG2 | desmoglein 2 | NM_001943 | 1829 |
deg_results <- sigident::limmaTopTable_(mergeset = mergeset,
design = design,
qValue = FDR)
data.table::fwrite(deg_results, paste0(csvdir, "DEG_results.csv"))| Probe ID | logFC | adj.qValue | |
|---|---|---|---|
| 217046_s_at | 217046_s_at | -4.210404 | 0 |
| 210081_at | 210081_at | -5.357901 | 0 |
| 209470_s_at | 209470_s_at | -4.554046 | 0 |
| 209469_at | 209469_at | -3.945938 | 0 |
| 206209_s_at | 206209_s_at | -4.396728 | 0 |
| 209904_at | 209904_at | -4.090455 | 0 |
A common investigation regarding differentially expressed genes analysis is the functional annotation of the DEGs. Furthermore, it is useful to find out if the DEGs are associated with biological processes or molecular functions. The functional analysis supports GO annotation from OrgDb object.
For the identification of enriched GO terms and KEGG pathways an over-representation analysis is performed based on linear models. As input the Entrez-IDs and the definition of the species are needed. extractGOterms_ and extractKEGGterms_ output tables containing the most significant top GO terms and KEGG terms respectively. In either case, rows are sorted by the minimum p-value with
- Term: GO term
- (Pathway: KEGG pathway)
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- DE: number of genes in the input Entrez Gene IDs (deg_entrez)
- P.DE: p-value for over-representation of the GO term in the set
enr_topgo <- sigident::extractGOterms_(gene = deg_entrez,
species = species)
data.table::fwrite(enr_topgo, paste0(csvdir, "Top_GO.csv"))| Term | Ont | N | DE | P.DE | |
|---|---|---|---|---|---|
| GO:0044444 | cytoplasmic part | CC | 9299 | 8712 | 0 |
| GO:0005737 | cytoplasm | CC | 10896 | 10117 | 0 |
| GO:0005515 | protein binding | MF | 11329 | 10693 | 0 |
| GO:0005488 | binding | MF | 14625 | 13242 | 0 |
| GO:0003674 | molecular_function | MF | 16969 | 14960 | 0 |
| GO:0005622 | intracellular | CC | 13992 | 12445 | 0 |
enr_topkegg <- sigident::extractKEGGterms_(gene = deg_entrez,
species = species)
data.table::fwrite(enr_topkegg, paste0(csvdir, "Top_KEGG.csv"))| Pathway | N | DE | P.DE | |
|---|---|---|---|---|
| path:hsa05200 | Pathways in cancer | 530 | 514 | 0 |
| path:hsa01100 | Metabolic pathways | 1439 | 1333 | 0 |
| path:hsa04151 | PI3K-Akt signaling pathway | 354 | 341 | 0 |
| path:hsa04360 | Axon guidance | 181 | 179 | 0 |
| path:hsa05165 | Human papillomavirus infection | 330 | 318 | 0 |
| path:hsa04810 | Regulation of actin cytoskeleton | 213 | 209 | 0 |
The following test for over-representation is based on the same method, but also taking into account for over expression and under expression in enriched terms with
- Term: GO term
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- Up: number of up-regulated DEGs
- Down: number of down-regulated DEGs
- P.Up: p-value for over-representation of GO term in up-regulated genes
- P.Down: p-value for over-representation of GO term in down-regulated genes
enr_fitlm <- sigident::goDiffReg_(mergeset = mergeset,
idtype = idtype,
design = design,
entrezids = mergedset@featureData@data$ENTREZ_GENE_ID)
enr_fitlm_topgo <- sigident::extractGOterms_(gene = enr_fitlm,
species = species,
FDR = 0.01)
data.table::fwrite(enr_fitlm_topgo, paste0(csvdir, "Top_GO_fitlm.csv"))| Term | Ont | N | Up | Down | P.Up | P.Down | |
|---|---|---|---|---|---|---|---|
| GO:0043231 | intracellular membrane-bounded organelle | CC | 9294 | 4652 | 3261 | 0 | 0 |
| GO:0005622 | intracellular | CC | 12445 | 5898 | 4493 | 0 | 0 |
| GO:0044424 | intracellular part | CC | 12445 | 5898 | 4493 | 0 | 0 |
| GO:0070013 | intracellular organelle lumen | CC | 4844 | 2715 | 1678 | 0 | 0 |
| GO:0031974 | membrane-enclosed lumen | CC | 4844 | 2715 | 1678 | 0 | 0 |
| GO:0043233 | organelle lumen | CC | 4844 | 2715 | 1678 | 0 | 0 |
enr_fitlm_topkegg <- sigident::extractKEGGterms_(gene = enr_fitlm,
species = species)
data.table::fwrite(enr_fitlm_topkegg, paste0(csvdir, "Top_KEGG_fitlm.csv"))| Pathway | N | Up | Down | P.Up | P.Down | |
|---|---|---|---|---|---|---|
| path:hsa05200 | Pathways in cancer | 514 | 230 | 258 | 0.6191996 | 0.0000000 |
| path:hsa03013 | RNA transport | 143 | 113 | 41 | 0.0000000 | 0.8632953 |
| path:hsa04380 | Osteoclast differentiation | 126 | 37 | 86 | 0.9999171 | 0.0000000 |
| path:hsa04010 | MAPK signaling pathway | 283 | 113 | 158 | 0.9710721 | 0.0000000 |
| path:hsa04144 | Endocytosis | 230 | 93 | 134 | 0.9410593 | 0.0000000 |
| path:hsa04510 | Focal adhesion | 194 | 85 | 115 | 0.6883347 | 0.0000000 |
GO enrichment analysis of a given set of genes (deg_entrez) with the defined organism is based on the fitted linear models of the object enr_fitlm as output of the goDiffReg_ function.
goEnrichmentAnalysis provides the presentation of the desired KEGG pathway ID (in this case hsa04151 - PI3K-Akt-Pathway) including under expressed (red) and over expressed (green) genes inside the pathway as png images in the folder named ‘plots’ (object ‘plotdir’).
enr_analysis <- sigident::goEnrichmentAnalysis_(gene = deg_entrez,
OrgDB = OrgDb,
organism = organism,
fitlm = enr_fitlm,
pathwayid = pathwayid,
species = organism,
plotdir = plotdir)Using createEnrichtedBarplot_, a bar plot is created depicting the enrichment scores (adj. p-value) as bar color and gene count as bar height. The parameter type controls whether to depict enriched GO terms or enriched KEGG terms.
filename <- paste0(plotdir, "Enriched_GO.png")
sigident::createEnrichtedBarplot_(enrichmentobj = enr_analysis$go,
type = "GO",
filename = filename)filename <- paste0(plotdir, "Enriched_KEGG.png")
sigident::createEnrichtedBarplot_(enrichmentobj = enr_analysis$kegg,
type = "KEGG",
filename = filename)A common task regarding gene expression data is to distinguish between different groups or subtypes. In order to train a multi-gene classifier, a large amount of expression data is required. Due to the imbalance of features (p) versus observations (n), the application of an appropriate method is needed.
The lasso regression and elastic net regularization are generalized linear models which are powerful and versatile methods for feature selection and are well suited in order to analyse genomic data with p >> n.
Therefore, we implemented the elastic net regularization method for the identification of a diagnostic multi-gene signature.
A statistical prediction model is usually trained on 70%-80% percent of the present data. createTrainingTest_ randomly splits the merged dataset into a training and a test set with the former defined value split regarding to a balanced distribution of the groups (here diagnosis). For reproducibility, a seed must be set.
Regularization requires the selection of a tuning parameter (lambda) that determines the strength of regularization. The function signature_ performs an nfolds cross-validation to select the best lambda and additionally builds the predictive model. seed is again set for reproducibility. The plot depicts the mean squared error as the criteron for the 10-fold cross-validation and shows the optimal values of lambda and the appropriate number of selected features (genes) above the plot. The left dashed vertical line represents the log of the optimal value of lambda (‘lambda min’), which results in the lowest prediction error and giving the most accurate model. The right dashed vertical line indicates the log of the lambda value resulting in a model with the smallest number of included variables but also lies within one standard error of the optimal value of lambda (‘lambda 1se’).
Furthermore, a ROC curve is plotted to visualize the performance measurement of the model.
diagnostic_lasso <- sigident::signature_(traininglist = training_list,
type = "lasso",
nfolds = nfolds,
seed = seed)
filename <- paste0(plotdir, "CV_lasso.png")
sigident::createCVPlot_(cv_obj = diagnostic_lasso$fitCV,
filename = filename)filename <- paste0(plotdir, "ROC_Lasso.min.png")
sigident::createROCplot_(roc = diagnostic_lasso$roc.min,
filename = filename)filename <- paste0(plotdir, "ROC_Lasso.1se.png")
sigident::createROCplot_(roc = diagnostic_lasso$roc.1se,
filename = filename)The elastic net is a mixture of lasso and ridge regularization to shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in lasso). Therefore, the mixing parameter alpha needs to be determined as value between 1 (for lasso) and 0 (for ridge). A grid search for calculating optimal alpha and lambda can be conducted (as it is done next) but might take much computing capacity, hence we propose setting alpha manually to 0.9.
diagnostic_elasticnet <- sigident::signature_(traininglist = training_list,
type = "elastic",
alpha = 0.9,
nfolds = nfolds,
seed = seed)
filename <- paste0(plotdir, "CV_elasticNet.png")
sigident::createCVPlot_(cv_obj = diagnostic_elasticnet$fitCV,
filename = filename)filename <- paste0(plotdir, "ROC_elasticNet.min.png")
sigident::createROCplot_(roc = diagnostic_elasticnet$roc.min,
filename = filename)filename <- paste0(plotdir, "ROC_elasticNet.1se.png")
sigident::createROCplot_(roc = diagnostic_elasticnet$roc.1se,
filename = filename)Setting type = "grid" of the function signature_ conducts a grid search calculating a pair of alpha and lambda for the ‘glmnet’-method, that minimizes the CV error, using the R package caret.
diagnostic_glmGrid <- sigident::signature_(traininglist = training_list,
type = "grid",
nfolds = nfolds,
seed = seed)
# plot model of gridsearch
filename <- paste0(plotdir, "Gridsearch_model.png")
sigident::createGridModelPlot_(model = diagnostic_glmGrid$caret.train,
filename = filename)# plot variable importance of gridsearch
filename <- paste0(plotdir, "Gridsearch_variable_importance.png")
sigident::createGridVarImpPlot_(model = diagnostic_glmGrid$caret.train,
filename = filename)# create roc plot
filename <- paste0(plotdir, "ROC_elasticNet.grid.png")
sigident::createROCplot_(roc = diagnostic_glmGrid$roc.elasticNet,
filename = filename)These models can be included into a list and the performance metrices for classification can then be printed using the function compareDiagnosticModels(). This list includes all information of the respective models, which can be accessed by the following structure.
diagnosticModels <- list(
"lasso" = list("CV" = diagnostic_lasso$fitCV,
"min" = list("model" = diagnostic_lasso$lambda.min,
"confmat" = diagnostic_lasso$confmat.min,
"prediction" = diagnostic_lasso$predicted.min,
"auc" = as.numeric(diagnostic_lasso$roc.min$auc)),
"1se" = list("model" = diagnostic_lasso$lambda.1se,
"confmat" = diagnostic_lasso$confmat.1se,
"prediction" = diagnostic_lasso$predicted.1se,
"auc" = as.numeric(diagnostic_lasso$roc.1se$auc))
),
"elasticnet" = list("CV" = diagnostic_elasticnet$fitCV,
"min" = list("model" = diagnostic_elasticnet$lambda.min,
"confmat" = diagnostic_elasticnet$confmat.min,
"prediction" = diagnostic_elasticnet$predicted.min,
"auc" = as.numeric(diagnostic_elasticnet$roc.min$auc)),
"1se" = list("model" = diagnostic_elasticnet$lambda.1se,
"confmat" = diagnostic_elasticnet$confmat.1se,
"prediction" = diagnostic_elasticnet$predicted.1se,
"auc" = as.numeric(diagnostic_elasticnet$roc.1se$auc))
),
"grid" = list("CV" = diagnostic_glmGrid$caret.train,
"model" = diagnostic_glmGrid$elasticNet.auto,
"confmat" = diagnostic_glmGrid$confmat.elasticNet,
"prediction" = diagnostic_elasticnet$predicted.elasticNet,
"auc" = as.numeric(diagnostic_glmGrid$roc.elasticNet$auc))
)| Model | AUC |
|---|---|
| lasso.min | 0.981 |
| lasso.1se | 0.983 |
| elasticnet.min | 0.981 |
| elasticnet.1se | 0.983 |
| grid | 0.976 |
A function getDiagnosticLambdaValues was designed, to extract the results of the cross validation.
| Model | Lambda min | Lambda 1se |
|---|---|---|
| lasso | 0.009260 | 0.059521 |
| elasticnet | 0.010288 | 0.063129 |
| grid (best tune, alpha, lambda) | 0.100000 | 0.521401 |
Additionally, the confusion matrix and the calculated performance metrices can be printed out for each model.
diagnosticModels$lasso$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 2
#> 1 2 37
#>
#> Accuracy : 0.9452
#> 95% CI : (0.8656, 0.9849)
#> No Information Rate : 0.5342
#> P-Value [Acc > NIR] : 8.964e-15
#>
#> Kappa : 0.8899
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9487
#> Specificity : 0.9412
#> Pos Pred Value : 0.9487
#> Neg Pred Value : 0.9412
#> Prevalence : 0.5342
#> Detection Rate : 0.5068
#> Detection Prevalence : 0.5342
#> Balanced Accuracy : 0.9449
#>
#> 'Positive' Class : 1
#> diagnosticModels$lasso[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 34 2
#> 1 0 37
#>
#> Accuracy : 0.9726
#> 95% CI : (0.9045, 0.9967)
#> No Information Rate : 0.5342
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9452
#>
#> Mcnemar's Test P-Value : 0.4795
#>
#> Sensitivity : 0.9487
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9444
#> Prevalence : 0.5342
#> Detection Rate : 0.5068
#> Detection Prevalence : 0.5068
#> Balanced Accuracy : 0.9744
#>
#> 'Positive' Class : 1
#> diagnosticModels$elasticnet$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 2
#> 1 2 37
#>
#> Accuracy : 0.9452
#> 95% CI : (0.8656, 0.9849)
#> No Information Rate : 0.5342
#> P-Value [Acc > NIR] : 8.964e-15
#>
#> Kappa : 0.8899
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9487
#> Specificity : 0.9412
#> Pos Pred Value : 0.9487
#> Neg Pred Value : 0.9412
#> Prevalence : 0.5342
#> Detection Rate : 0.5068
#> Detection Prevalence : 0.5342
#> Balanced Accuracy : 0.9449
#>
#> 'Positive' Class : 1
#> diagnosticModels$elasticnet[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 34 2
#> 1 0 37
#>
#> Accuracy : 0.9726
#> 95% CI : (0.9045, 0.9967)
#> No Information Rate : 0.5342
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9452
#>
#> Mcnemar's Test P-Value : 0.4795
#>
#> Sensitivity : 0.9487
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9444
#> Prevalence : 0.5342
#> Detection Rate : 0.5068
#> Detection Prevalence : 0.5068
#> Balanced Accuracy : 0.9744
#>
#> 'Positive' Class : 1
#> diagnosticModels$grid$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 2
#> 1 2 37
#>
#> Accuracy : 0.9452
#> 95% CI : (0.8656, 0.9849)
#> No Information Rate : 0.5342
#> P-Value [Acc > NIR] : 8.964e-15
#>
#> Kappa : 0.8899
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9487
#> Specificity : 0.9412
#> Pos Pred Value : 0.9487
#> Neg Pred Value : 0.9412
#> Prevalence : 0.5342
#> Detection Rate : 0.5068
#> Detection Prevalence : 0.5342
#> Balanced Accuracy : 0.9449
#>
#> 'Positive' Class : 1
#> The following function enables the extraction of the selected genes (Affy-IDs) for each model as csv file.
signature_genes <- sigident::geneMapSig_(mergeset = mergeset, model = diagnosticModels$lasso$min$model)
data.table::fwrite(signature_genes, paste0(csvdir, "signature_genes_lasso_min.csv"))signature_genes <- sigident::geneMapSig_(mergeset = mergeset, model = diagnosticModels$lasso[["1se"]]$model)
data.table::fwrite(signature_genes, paste0(csvdir, "signature_genes_lasso_1se.csv"))The validation of diagnostic signatures is conducted using expression data of the studies “GSE30219”, “GSE33356” and “GSE102287”. We therefore need to create a list, holding meta information about the respective studies.
All models can be provided via the argument ‘model’ to the function validateDiagnosticSignature_ using the previously created list diagnosticModels.
You have to create a list for each validation data that contains specifications of the validation study. The object must be named The list structure is as follows:
- studyname: the GEO id of the validation study
- setid: the index of the provided GEO set to be used
- targetcolname: the name of the phenoData column containing the target-variable - controllevelname: the level name of the controls of the column, specified in targetcolname - targetlevelname: the level name of the subjects of interest in targetcolname
validationstudylist <- list(studyname = "GSE30219",
setid = 1,
targetcolname = "source_name_ch1",
controllevelname = "Non Tumoral Lung",
targetlevelname = "Lung Tumour")GSE30219_validation <- sigident::validateDiagnosticSignature_(
validationstudylist = validationstudylist,
models = diagnosticModels,
genes = genes,
idtype = idtype,
targetname = targetname,
controlname = controlname,
targetcol = targetcol,
datadir = datadir
)
#> Found 1 file(s)
#> GSE30219_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> Using locally cached version of GPL570 found here:
#> ./geodata/data//GPL570.soft
#> Warning: 62 parsing failures.
#> row col expected actual file
#> 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> ..... ....... .................. ......... ............
#> See problems(...) for more details.
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < casesGSE30219_validation$lasso$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13 15
#> 1 1 278
#>
#> Accuracy : 0.9479
#> 95% CI : (0.9167, 0.9699)
#> No Information Rate : 0.9544
#> P-Value [Acc > NIR] : 0.760012
#>
#> Kappa : 0.5944
#>
#> Mcnemar's Test P-Value : 0.001154
#>
#> Sensitivity : 0.9488
#> Specificity : 0.9286
#> Pos Pred Value : 0.9964
#> Neg Pred Value : 0.4643
#> Prevalence : 0.9544
#> Detection Rate : 0.9055
#> Detection Prevalence : 0.9088
#> Balanced Accuracy : 0.9387
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE30219_lasso_min.png")
sigident::createROCplot_(roc = GSE30219_validation$lasso$min$roc,
filename = filename)GSE30219_validation$lasso[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13 2
#> 1 1 291
#>
#> Accuracy : 0.9902
#> 95% CI : (0.9717, 0.998)
#> No Information Rate : 0.9544
#> P-Value [Acc > NIR] : 0.0003852
#>
#> Kappa : 0.8914
#>
#> Mcnemar's Test P-Value : 1.0000000
#>
#> Sensitivity : 0.9932
#> Specificity : 0.9286
#> Pos Pred Value : 0.9966
#> Neg Pred Value : 0.8667
#> Prevalence : 0.9544
#> Detection Rate : 0.9479
#> Detection Prevalence : 0.9511
#> Balanced Accuracy : 0.9609
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE30219_lasso_1se.png")
sigident::createROCplot_(roc = GSE30219_validation$lasso[["1se"]]$roc,
filename = filename)GSE30219_validation$elasticnet$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13 12
#> 1 1 281
#>
#> Accuracy : 0.9577
#> 95% CI : (0.9287, 0.9773)
#> No Information Rate : 0.9544
#> P-Value [Acc > NIR] : 0.461966
#>
#> Kappa : 0.646
#>
#> Mcnemar's Test P-Value : 0.005546
#>
#> Sensitivity : 0.9590
#> Specificity : 0.9286
#> Pos Pred Value : 0.9965
#> Neg Pred Value : 0.5200
#> Prevalence : 0.9544
#> Detection Rate : 0.9153
#> Detection Prevalence : 0.9186
#> Balanced Accuracy : 0.9438
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE30219_elasticnet_min.png")
sigident::createROCplot_(roc = GSE30219_validation$elasticnet$min$roc,
filename = filename)GSE30219_validation$elasticnet[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13 2
#> 1 1 291
#>
#> Accuracy : 0.9902
#> 95% CI : (0.9717, 0.998)
#> No Information Rate : 0.9544
#> P-Value [Acc > NIR] : 0.0003852
#>
#> Kappa : 0.8914
#>
#> Mcnemar's Test P-Value : 1.0000000
#>
#> Sensitivity : 0.9932
#> Specificity : 0.9286
#> Pos Pred Value : 0.9966
#> Neg Pred Value : 0.8667
#> Prevalence : 0.9544
#> Detection Rate : 0.9479
#> Detection Prevalence : 0.9511
#> Balanced Accuracy : 0.9609
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE30219_elasticnet_1se.png")
sigident::createROCplot_(roc = GSE30219_validation$elasticnet[["1se"]]$roc,
filename = filename)validationstudylist <- list(studyname = "GSE33356",
setid = 1,
targetcolname = "tissue:ch1",
controllevelname = "paired normal adjacent",
targetlevelname = "lung cancer")GSE33356_validation <- sigident::validateDiagnosticSignature_(
validationstudylist = validationstudylist,
models = diagnosticModels,
genes = genes,
idtype = idtype,
targetname = targetname,
controlname = controlname,
targetcol = targetcol,
datadir = datadir
)
#> Found 2 file(s)
#> GSE33356-GPL570_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> Using locally cached version of GPL570 found here:
#> ./geodata/data//GPL570.soft
#> Warning: 62 parsing failures.
#> row col expected actual file
#> 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> ..... ....... .................. ......... ............
#> See problems(...) for more details.
#> GSE33356-GPL6801_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#> .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> File stored at:
#> ./geodata/data//GPL6801.soft
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < casesGSE33356_validation$lasso$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 60 2
#> 1 0 58
#>
#> Accuracy : 0.9833
#> 95% CI : (0.9411, 0.998)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9667
#>
#> Mcnemar's Test P-Value : 0.4795
#>
#> Sensitivity : 0.9667
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9677
#> Prevalence : 0.5000
#> Detection Rate : 0.4833
#> Detection Prevalence : 0.4833
#> Balanced Accuracy : 0.9833
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE33356_lasso_min.png")
sigident::createROCplot_(roc = GSE33356_validation$lasso$min$roc,
filename = filename)GSE33356_validation$lasso[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 58 2
#> 1 2 58
#>
#> Accuracy : 0.9667
#> 95% CI : (0.9169, 0.9908)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9333
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9667
#> Specificity : 0.9667
#> Pos Pred Value : 0.9667
#> Neg Pred Value : 0.9667
#> Prevalence : 0.5000
#> Detection Rate : 0.4833
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.9667
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE33356_lasso_1se.png")
sigident::createROCplot_(roc = GSE33356_validation$lasso[["1se"]]$roc,
filename = filename)GSE33356_validation$elasticnet$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 60 5
#> 1 0 55
#>
#> Accuracy : 0.9583
#> 95% CI : (0.9054, 0.9863)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.9167
#>
#> Mcnemar's Test P-Value : 0.07364
#>
#> Sensitivity : 0.9167
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9231
#> Prevalence : 0.5000
#> Detection Rate : 0.4583
#> Detection Prevalence : 0.4583
#> Balanced Accuracy : 0.9583
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE33356_elasticnet_min.png")
sigident::createROCplot_(roc = GSE33356_validation$elasticnet$min$roc,
filename = filename)GSE33356_validation$elasticnet[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 58 2
#> 1 2 58
#>
#> Accuracy : 0.9667
#> 95% CI : (0.9169, 0.9908)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9333
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9667
#> Specificity : 0.9667
#> Pos Pred Value : 0.9667
#> Neg Pred Value : 0.9667
#> Prevalence : 0.5000
#> Detection Rate : 0.4833
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.9667
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE33356_elasticnet_1se.png")
sigident::createROCplot_(roc = GSE33356_validation$elasticnet[["1se"]]$roc,
filename = filename)validationstudylist <- list(studyname = "GSE102287",
setid = 2,
targetcolname = "tumor_normal status:ch1",
controllevelname = "N",
targetlevelname = "T")GSE102287_validation <- sigident::validateDiagnosticSignature_(
validationstudylist = validationstudylist,
models = diagnosticModels,
genes = genes,
idtype = idtype,
targetname = targetname,
controlname = controlname,
targetcol = targetcol,
datadir = datadir
)
#> Found 2 file(s)
#> GSE102287-GPL23871_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> File stored at:
#> ./geodata/data//GPL23871.soft
#> GSE102287-GPL570_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> Using locally cached version of GPL570 found here:
#> ./geodata/data//GPL570.soft
#> Warning: 62 parsing failures.
#> row col expected actual file
#> 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> ..... ....... .................. ......... ............
#> See problems(...) for more details.
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < casesGSE102287_validation$lasso$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 1
#> 1 2 31
#>
#> Accuracy : 0.9545
#> 95% CI : (0.8729, 0.9905)
#> No Information Rate : 0.5152
#> P-Value [Acc > NIR] : 3.899e-15
#>
#> Kappa : 0.9091
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9688
#> Specificity : 0.9412
#> Pos Pred Value : 0.9394
#> Neg Pred Value : 0.9697
#> Prevalence : 0.4848
#> Detection Rate : 0.4697
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.9550
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE102287_lasso_min.png")
sigident::createROCplot_(roc = GSE102287_validation$lasso$min$roc,
filename = filename)GSE102287_validation$lasso[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 1
#> 1 2 31
#>
#> Accuracy : 0.9545
#> 95% CI : (0.8729, 0.9905)
#> No Information Rate : 0.5152
#> P-Value [Acc > NIR] : 3.899e-15
#>
#> Kappa : 0.9091
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9688
#> Specificity : 0.9412
#> Pos Pred Value : 0.9394
#> Neg Pred Value : 0.9697
#> Prevalence : 0.4848
#> Detection Rate : 0.4697
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.9550
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE102287_lasso_1se.png")
sigident::createROCplot_(roc = GSE102287_validation$lasso[["1se"]]$roc,
filename = filename)GSE102287_validation$elasticnet$min$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 0
#> 1 2 32
#>
#> Accuracy : 0.9697
#> 95% CI : (0.8948, 0.9963)
#> No Information Rate : 0.5152
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9394
#>
#> Mcnemar's Test P-Value : 0.4795
#>
#> Sensitivity : 1.0000
#> Specificity : 0.9412
#> Pos Pred Value : 0.9412
#> Neg Pred Value : 1.0000
#> Prevalence : 0.4848
#> Detection Rate : 0.4848
#> Detection Prevalence : 0.5152
#> Balanced Accuracy : 0.9706
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE102287_elasticnet_min.png")
sigident::createROCplot_(roc = GSE102287_validation$elasticnet$min$roc,
filename = filename)GSE102287_validation$elasticnet[["1se"]]$confmat
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 32 1
#> 1 2 31
#>
#> Accuracy : 0.9545
#> 95% CI : (0.8729, 0.9905)
#> No Information Rate : 0.5152
#> P-Value [Acc > NIR] : 3.899e-15
#>
#> Kappa : 0.9091
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9688
#> Specificity : 0.9412
#> Pos Pred Value : 0.9394
#> Neg Pred Value : 0.9697
#> Prevalence : 0.4848
#> Detection Rate : 0.4697
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.9550
#>
#> 'Positive' Class : 1
#> # create roc plot
filename <- paste0(plotdir, "ROC_validation_GSE102287_elasticnet_1se.png")
sigident::createROCplot_(roc = GSE102287_validation$elasticnet[["1se"]]$roc,
filename = filename)For the calculation of a prognostic signature it is first required to transform the phenoData containing the relevant survival data. In our example, only the study GSE19188 contains information on a subjects survival.
In order to make the subsequent analyses possible, the list discoverystudies.w.timedata, containing metainformation about the validation study needs to be created. This nested list needs to contain the following elements: - the studyname, which is the key for a list containing + setid: the index of the provided GEO set to be used
+ timecol: the phenoData columm name of the survival time + status: a list, containing the elements statuscol (name of phenoData column containing the survival status), levels (a list containing the originally used level-names of statuscol, namely alive, deceased, na) + targetcolname: the name of the phenoData column containing the target-variable + controllevelname: the level name of the controls of the column, specified in targetcolname + targetlevelname: the level name of the subjects of interest in targetcolname + use_rawdata: a logical value (TRUE or FALSE) that indicates, if the study is imported from the raw data
Since we already used the raw data above to load the study ‘GSE19188’ for the analyses of the DEGs, the enrichment analysis and the computation of the diagnostic signature, we here again import the study from the raw CEL files.
discoverystudies.w.timedata <- list(
"GSE19188" = list(setid = 1,
timecol = "characteristics_ch1.2",
status = list(statuscol = "characteristics_ch1.3",
levels = list(alive = "status: alive",
deceased = "status: deceased",
na = "status: Not available")),
targetcolname = "characteristics_ch1",
controllevelname = "tissue type: healthy",
targetlevelname = "tissue type: tumor",
use_rawdata = TRUE))In order to build an unbiased classifier that speparates subjects into high risk and low risk groups, expression data from the remaining discovery studies are required.
The classifier takes the expression profiles of the identified survival correlated genes between tumor and non-tumor samples into account.
classifier_studies <- c("GSE18842", "GSE19804")
exprPattern <- sigident::generateExpressionPattern_(classifier_studies = classifier_studies,
sigCov = surv_correlated,
mergeset = mergeset,
studyMetadata = studyMetadata,
sampleMetadata = sampleMetadata,
controlname = controlname,
targetname = targetname,
targetcol = targetcol)In order to apply the prognostic classifier on validation datasets, another list containing metainformation about these validation studies is required. This list (validationstudiesinfo) has the same structure as the above described list discoverystudies.w.timedata.
However, the element use_rawdata is not allowed here. Furthermore, is allowed to to set targetlevelname to ‘NULL’, which indicates, that all included samples are of the class of interest and are taken into account to perform the analyses.
validationstudiesinfo <- list(
"GSE30219" = list(setid = 1,
timecol = "characteristics_ch1.9",
status = list(statuscol = "characteristics_ch1.8",
levels = list(alive = "status: ALIVE",
deceased = "status: DEAD",
na = "status: NTL")),
targetcolname = "source_name_ch1",
controllevelname = "Non Tumoral Lung",
targetlevelname = "Lung Tumour"),
"GSE50081" = list(setid = 1,
timecol = "characteristics_ch1.8",
status = list(statuscol = "characteristics_ch1.9",
levels = list(alive = "status: alive",
deceased = "status: dead",
na = NA)),
targetcolname = "characteristics_ch1.1",
targetlevelname = NULL,
controllevelname = NULL))
# if targetlevelname == NULL, the expression set will not be filtered further
# but instead, all samples will be used for calculationsHere we extract information on the study GSE30219.
dim(RiskTable)
#> [1] 293 3
head(RiskTable)
#> time status Groups
#> GSM748053 121 1 0
#> GSM748054 21 2 0
#> GSM748055 86 2 0
#> GSM748056 10 2 0
#> GSM748057 81 2 0
#> GSM748058 68 1 0pC[["GSE30219"]]$kaplan.estimator$res.cox
#> Call:
#> survival::coxph(formula = survival::Surv(time, status) ~ Groups,
#> data = risktable)
#>
#> coef exp(coef) se(coef) z p
#> Groups 0.5491 1.7317 0.1507 3.643 0.000269
#>
#> Likelihood ratio test=13.68 on 1 df, p=0.0002166
#> n= 278, number of events= 188
#> (15 observations deleted due to missingness)dim(RiskTable)
#> [1] 181 3
head(RiskTable)
#> time status Groups
#> GSM1213669 4.11 2 1
#> GSM1213670 7.52 1 0
#> GSM1213671 7.32 1 0
#> GSM1213672 6.84 1 0
#> GSM1213673 1.91 2 1
#> GSM1213674 7.03 1 1pC[["GSE50081"]]$kaplan.estimator$res.cox
#> Call:
#> survival::coxph(formula = survival::Surv(time, status) ~ Groups,
#> data = risktable)
#>
#> coef exp(coef) se(coef) z p
#> Groups 0.5089 1.6635 0.2361 2.156 0.0311
#>
#> Likelihood ratio test=4.76 on 1 df, p=0.02919
#> n= 181, number of events= 75In order to simplify the workflow described above, we wrapped all these functions into four big functions.
The preprocessing of the data needs to conform with the above described approach.
# initialize filePath:
filePath <- tempdir()
maindir <- "./geodata/"
datadir <- paste0(maindir, "data/")
dir.create(maindir)
#> Warning in dir.create(maindir): './geodata' already exists
dir.create(datadir)
#> Warning in dir.create(datadir): './geodata/data' already exists
# initialize more infos on the study
targetcol <- "target" # should be named "target"
controlname <- "Control"
targetname <- "Lung Cancer"
# define more variables
plotdir <- "./plots/"
dir.create(plotdir)
#> Warning in dir.create(plotdir): './plots' already exists
csvdir <- "./csv/"
dir.create(csvdir)
#> Warning in dir.create(csvdir): './csv' already exists
# pathway
species <- "Hs"
OrgDb <- "org.Hs.eg.db"
organism <- "hsa"
pathwayid <- "hsa04110"
# diagnostig signature
seed <- 111
split <- 0.8
nfolds <- 10
# other variables
idtype = "affy"
FDR <- 0.05mergesetsigidentDEG-functionsigidentEnrichment-functionsigidentDiagnostic-functionsigidentPrognostic-Functiondiscoverystudies.w.timedata <- list(
"GSE19188" = list(setid = 1,
timecol = "characteristics_ch1.2",
status = list(statuscol = "characteristics_ch1.3",
levels = list(alive = "status: alive",
deceased = "status: deceased",
na = "status: Not available")),
targetcolname = "characteristics_ch1",
controllevelname = "tissue type: healthy",
targetlevelname = "tissue type: tumor",
use_rawdata = TRUE))
validationstudiesinfo <- list(
"GSE30219" = list(setid = 1,
timecol = "characteristics_ch1.9",
status = list(statuscol = "characteristics_ch1.8",
levels = list(alive = "status: ALIVE",
deceased = "status: DEAD",
na = "status: NTL")),
targetcolname = "source_name_ch1",
controllevelname = "Non Tumoral Lung",
targetlevelname = "Lung Tumour"),
"GSE50081" = list(setid = 1,
timecol = "characteristics_ch1.8",
status = list(statuscol = "characteristics_ch1.9",
levels = list(alive = "status: alive",
deceased = "status: dead",
na = NA)),
targetcolname = "characteristics_ch1.1",
targetlevelname = NULL,
controllevelname = NULL))
# if targetlevelname == NULL, the expression set will not be filtered further
# but instead, all samples will be used for calculationsprogn_results <- sigident::sigidentPrognostic(mergeset = mergeset,
studyMetadata = studyMetadata,
sampleMetadata = sampleMetadata,
idtype = idtype,
genes = genes,
discoverystudies.w.timedata = discoverystudies.w.timedata,
classifier_studies = c("GSE18842", "GSE19804"),
validationstudiesinfo = validationstudiesinfo,
targetname = targetname,
controlname = controlname,
datadir = datadir,
plotdir = plotdir,
csvdir = csvdir,
targetcol = targetcol)[1] W.E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007.